Course: VISUAL ANALYTICS FOR POLICY AND MANAGEMENT

Prof. José Manuel Magallanes, PhD

  • Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
  • Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú.

Tabular data: Multivariate data


We collect multiple variables for a particular purpose, knowing that social complexity can hardly be directly explained with bivariate or univariate approaches. As it is difficult to visualize information with high dimensional data; you should consider summarising all these variable into a simpler form.

Let me start from the simplest.


This time, I will use the data about city safety:

# clean memory
rm(list = ls()) 

# location of the data
link="https://github.com/EvansDataScience/data/raw/master/safeCitiesIndexAll.xlsx"

# 'rio' can be used to import EXCEL files:
library(rio)
safe=import(link)

These several variables are telling us information about the safety levels of some cities in the world, and are related to D_igital_, H_ealth_, I_nfrastructure_, and P_ersonal_ dimensions. For each of these dimensions, there are measures of actions taken (In), and results obtained (Out). We have 49 variables.

Would making a plot of 49 variables be a good idea? A heatmap is the usual answer.

Making a heatmap in ggplot2 requires a file in a particular shape, known as the long format.

The data frame safe is in wide format, that is, one observation per row (the most usual format):

head(safe)
##        city D_In_PrivacyPolicy D_In_AwarenessDigitalThreats
## 1 Abu Dhabi                 50                     66.66667
## 2 Amsterdam                100                    100.00000
## 3    Athens                 75                    100.00000
## 4   Bangkok                 25                     66.66667
## 5 Barcelona                100                    100.00000
## 6   Beijing                 75                     66.66667
##   D_In_PubPrivPartnerships D_In_TechnologyEmployed D_In_CyberSecurity
## 1                       50                     100                 50
## 2                       50                     100                 50
## 3                        0                      75                 50
## 4                        0                       0                 50
## 5                       50                     100                 50
## 6                        0                      75                100
##   D_Out_IdentityTheft D_Out_CompInfected D_Out_InternetAccess
## 1           100.00000                 25             97.64526
## 2            99.74426                 75            100.00000
## 3           100.00000                 25             66.62939
## 4            99.99770                 50             31.66104
## 5            99.95572                 25             81.69309
## 6            99.99868                 25             77.25884
##   H_In_EnvironmentPolicies H_In_AccessHealthcare H_In_Beds_1000
## 1                 62.79070              73.33333       8.265146
## 2                 97.67442             100.00000      38.019671
## 3                 41.86047              73.33333      38.846186
## 4                 72.09302              80.00000      16.530292
## 5                 81.39535              93.33333      24.795438
## 6                 79.06977              86.66667      47.899860
##   H_In_Doctors_1000 H_In_AccessFood H_In_QualityHealthServ H_Out_AirQuality
## 1         16.365171            99.7                   75.0         44.21882
## 2         38.000482           100.0                  100.0         84.44259
## 3         73.010130           100.0                   62.5         85.00159
## 4          2.327545            98.2                   62.5         76.48108
## 5         43.632417           100.0                  100.0         84.74936
## 6         44.609262            96.2                   62.5         14.76146
##   H_Out_WaterQuality H_Out_LifeExpectY H_Out_InfMortality H_Out_CancerMortality
## 1           91.34471          74.89746           93.15708              83.41277
## 2           94.49802          90.42099           97.35614              17.10638
## 3           87.64892          89.97567           96.73406              45.94894
## 4           79.33302          63.94232           86.00311              62.23830
## 5           78.28508          96.65553           96.88958              42.75745
## 6           85.95152          69.10362           88.02488              36.76596
##   H_Out_AttacksBioChemRad I_In_EnforceTransportSafety
## 1                     100                       100.0
## 2                     100                        70.0
## 3                     100                        60.0
## 4                     100                        52.5
## 5                     100                        82.5
## 6                     100                        77.5
##   I_In_PedestrianFriendliness I_In_QualityRoad I_In_QualityElectricity
## 1                         100              100                      75
## 2                         100              100                     100
## 3                         100               50                      75
## 4                          50               25                     100
## 5                         100              100                     100
## 6                          50               75                     100
##   I_In_DisasterManagement I_Out_DeathsDisaster I_Out_VehicularAccidents
## 1                      75            100.00000                 95.69134
## 2                     100             99.73027                 96.56438
## 3                      75             98.35766                 97.16392
## 4                      75             97.16755                 90.55944
## 5                     100             99.41487                 90.18464
## 6                      50             96.35063                 99.65628
##   I_Out_PedestrianDeath I_Out_LiveSlums I_Out_AttacksInfrastructure
## 1              67.95027       100.00000                   100.00000
## 2             100.00000        94.19238                   100.00000
## 3              84.87354        88.56624                    91.48936
## 4              68.25461        54.62795                    70.21277
## 5              93.83342       100.00000                   100.00000
## 6              42.12348        54.26497                   100.00000
##   P_In_PoliceEngage P_In_CommunityPatrol P_In_StreetCrimeData P_In_TechForCrime
## 1               100                  100                  100                 0
## 2               100                  100                  100               100
## 3                50                    0                  100               100
## 4                50                  100                    0               100
## 5               100                  100                  100               100
## 6                50                  100                  100               100
##   P_In_PrivateSecurity P_In_GunRegulation P_In_PoliticalStability
## 1                  100           47.61905                      55
## 2                  100           68.25397                      80
## 3                  100           31.74603                      70
## 4                  100           41.26984                      40
## 5                  100           66.66667                      70
## 6                  100          100.00000                      45
##   P_Out_PettyCrime P_Out_ViolentCrime P_Out_OrganisedCrime P_Out_Corruption
## 1               75                100                   75               66
## 2               50                 75                  100               83
## 3               75                100                   50               44
## 4               50                 75                   50               35
## 5               75                 75                   75               58
## 6               50                 75                   50               40
##   P_Out_DrugUse P_Out_TerroristAttacks P_Out_SeverityTerrorist
## 1      86.75869               99.85724                99.98014
## 2      64.62168               99.92862               100.00000
## 3      92.51022               79.65739                99.28486
## 4      92.43354               89.50749                88.55781
## 5      65.18405              100.00000               100.00000
## 6     100.00000               99.78587                98.64919
##   P_Out_GenderSafety P_Out_PerceptionSafety P_Out_ThreaTerrorism
## 1           96.11598                  84.49                   75
## 2           96.88446                  67.40                   75
## 3           92.62419                  49.28                   75
## 4           87.88981                  51.44                   25
## 5           96.81837                  60.35                   75
## 6           96.55779                  58.35                  100
##   P_Out_ThreatMilitaryConf P_Out_ThreatCivUnrest
## 1                      100                    75
## 2                      100                    75
## 3                       75                    50
## 4                       50                    25
## 5                      100                    75
## 6                       75                    75

This is how we produce a long format:

library(reshape2)

safeL=melt(safe, # all the data
           id.vars = 'city') # unique id per row
head(safeL)
##        city           variable value
## 1 Abu Dhabi D_In_PrivacyPolicy    50
## 2 Amsterdam D_In_PrivacyPolicy   100
## 3    Athens D_In_PrivacyPolicy    75
## 4   Bangkok D_In_PrivacyPolicy    25
## 5 Barcelona D_In_PrivacyPolicy   100
## 6   Beijing D_In_PrivacyPolicy    75

The melt function changed the direction of the data: the variables with the values were sent into rows, and the values are now to their right.

Now, the heatmap using this format:

library(ggplot2)

base = ggplot(data = safeL, 
              aes(x = variable,
                  y =city)) 

heat1= base +  geom_tile(aes(fill = value)) 
heat1

Here you can see what rows have higher or lower colors on what set of variables. You can add color pallette:

#inverse color -1
heat1_recolor = heat1 +
                scale_fill_gradient(low = 'grey90',
                                    high = "grey30")
heat1_recolor

The column and row names need some work:

heat1_axisGood= heat1_recolor + 
                theme(axis.text.x = element_text(angle = 90, 
                                         hjust = 1,
                                         size = 4),
              axis.text.y = element_text(size = 4))

heat1_axisGood

You do not see so far something relevant. Let’s reorder the cities:

base= ggplot(data = safeL, aes(x = variable,
                               y =reorder(city,
                                          value, median)))
# THIS IS THE SAME
base + geom_tile(aes(fill = value)) + 
    scale_fill_gradient(low = 'grey90',high = "grey50") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 4),
              axis.text.y = element_text(size = 4))

As nothing better emerges, let’s also reorder the variables:

base= ggplot(data = safeL, aes(x = reorder(variable, 
                                           value, median),
                               y =reorder(city,
                                          value, median)))
base + geom_tile(aes(fill = value)) + 
    scale_fill_gradient(low = 'grey90',high = "grey50") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 4),
              axis.text.y = element_text(size = 4))

This is still hard to read. An alternative could be to average each dimension. Let me show you this example:

  1. Variables you need to average: columns with the text ‘D_In_’:
library(magrittr)
#notice I do not need the 'city':
safe[,c(grep("D_In_", names(safe) ))]%>%head()
##   D_In_PrivacyPolicy D_In_AwarenessDigitalThreats D_In_PubPrivPartnerships
## 1                 50                     66.66667                       50
## 2                100                    100.00000                       50
## 3                 75                    100.00000                        0
## 4                 25                     66.66667                        0
## 5                100                    100.00000                       50
## 6                 75                     66.66667                        0
##   D_In_TechnologyEmployed D_In_CyberSecurity
## 1                     100                 50
## 2                     100                 50
## 3                      75                 50
## 4                       0                 50
## 5                     100                 50
## 6                      75                100
  1. Compute and save the average per row:
#averaging only numbers, as 'city' is not in data:
apply(safe[,c(grep("D_In_",names(safe) ))], #data
      1, #by row
      mean)%>% #function to apply by row to data
         round(2) # round the result
##      1      2      3      4      5      6      7      8      9     10     11 
##  63.33  80.00  60.00  28.33  80.00  63.33  41.67  80.00  66.67  26.67  46.67 
##     12     13     14     15     16     17     18     19     20     21     22 
##  46.67 100.00 100.00  63.33  26.67  58.33  80.00  21.67 100.00  60.00  28.33 
##     23     24     25     26     27     28     29     30     31     32     33 
##  58.33  58.33  35.00  58.33  41.67  58.33  80.00 100.00  80.00  31.67  90.00 
##     34     35     36     37     38     39     40     41     42     43     44 
##  58.33  83.33  41.67  63.33 100.00  78.33  73.33  48.33  43.33  48.33  80.00 
##     45     46     47     48     49     50     51     52     53     54     55 
## 100.00  58.33  43.33  66.67  53.33  95.00  75.00  90.00  88.33  16.67  88.33 
##     56     57     58     59     60 
##  90.00  90.00  68.33  16.67  75.00
  1. Create that column:
safe$DIN=apply(safe[,c(grep("D_In_",names(safe) ))],1,mean)%>%round(2)

Now let’s do the rest:

safe$DOUT=apply(safe[,c(grep("D_Out_",names(safe) ))],1,mean)%>%round(2)

safe$HIN=apply(safe[,c(grep("H_In_",names(safe) ))],1,mean)%>%round(2)

safe$HOUT=apply(safe[,c(grep("H_Out_", names(safe) ))],1,mean)%>%round(2)

safe$IIN=apply(safe[,c(grep("I_In_", names(safe) ))],1,mean)%>%round(2)

safe$IOUT=apply(safe[,c(grep("I_Out_", names(safe) ))],1,mean)%>%round(2)

safe$PIN=apply(safe[,c(grep("P_In_", names(safe) ))],1,mean)%>%round(2)

safe$POUT=apply(safe[,c(grep("P_Out_", names(safe) ))],1,mean)%>%round(2)

Let’s subset the previous data frame so that a new one has all the Input variables:

safeINS=safe[,c(1,grep("IN$", colnames(safe)))] # '$' for 'end with'.
head(safeINS)
##        city   DIN   HIN  IIN   PIN
## 1 Abu Dhabi 63.33 55.91 90.0 71.80
## 2 Amsterdam 80.00 78.95 94.0 92.61
## 3    Athens 60.00 64.93 72.0 64.54
## 4   Bangkok 28.33 55.28 60.5 61.61
## 5 Barcelona 80.00 73.86 96.5 90.95
## 6   Beijing 63.33 69.49 70.5 85.00

Let’s rename the vars in this data set:

names(safeINS)=c("city",'DIGITAL','HEALTH','INFRASTR','PERSONAL')

Remember we need to reshape:

safeINS_long=melt(safeINS,id.vars = 'city')

Let’s redo our heat map:

base= ggplot(data = safeINS_long, 
             aes(x = reorder(variable,
                             value, 
                             median),
                 y =reorder(city,
                            value, 
                            median)))

base + geom_tile(aes(fill = value)) + 
       scale_fill_gradient(low = 'grey90',
                           high = "grey50") +
       theme(axis.text.x = element_text(angle = 90,
                                        hjust = 1,
                                        size = 8),
             axis.text.y = element_text(size = 6))

When you have this limited amount of variables, you may try a radar plot:

base  = ggplot(safeINS_long, 
               aes(x = variable, 
                   y = value, 
                   group = city)) + #new
        geom_polygon(fill = 'gray',
                     col='orange') 

radar1 = base + coord_polar()

radar1 = radar1 + facet_wrap(~city,# one plot per city
                           ncol = 10) # ten plot per row
radar1

The radar plot describes how a case (here, a city) is doing in every dimension (we have four dimensions).

We could improve the plot by ordering the facet and increasing the font size of the name of dimensions (X), and having less columns:

radar1_re = radar1 + facet_wrap(~reorder(city,-value, median),ncol = 10)


radar1_re = radar1_re + theme(axis.text.x = element_text(size = 12)) 
radar1_re

We can also highlight the case’s names, let’s change the theme from above:

radar1_label = radar1_re + theme(legend.position="none",
                strip.text = element_text(size = 20)) #here!!!
radar1_label 

You could add extra customization if wanted:

### arguments
newBackGroundGrid=element_rect(fill = "white",
                         colour = "red",
                         size = 0.5,
                         linetype = "dashed")

newBackLineGrid=element_line(size = 1,
                      linetype = 'solid',
                      colour = "lightblue")

### more customization
radar1_label+ theme(panel.background = newBackGroundGrid,
             panel.grid.major = newBackLineGrid)

The colors above are not the best choice, I just used them for you to notice where to make changes. Keep in mind that areas are difficult to compare, so the plots above might be used with care (show not all the cities?).

The idea behind the radar plot can be represented by simpler plots:

ggplot(data=safeINS_long, 
       aes(x=reorder(city, value,median),
           y=value)) + theme_classic() +
    geom_point(shape=5)+
    geom_segment(aes(x=city,
                     y=0,
                     yend = value,
                     xend = city),
                 color='grey',size=0.2) +
    facet_grid(~variable) + coord_flip() + 
    theme(axis.text.y = element_text(size = 5))

Finally, let me produce a plot using all the input variables, but reducing their dimensionality combining clustering and multidimensional scaling.

You can use multidimensional scaling to represent all the variables in a two dimensional plot. This will create a map that will allow you to visualize “neighborhoods of similarity” of the cases (cities). With clustering, you will use the variables (columns) to define homogeneous groups among the cases (rows).

Let’s see:

  1. Get all the input columns:
allIN=safe[,c(grep("_In_", names(safe) ))]
names(allIN)
##  [1] "D_In_PrivacyPolicy"           "D_In_AwarenessDigitalThreats"
##  [3] "D_In_PubPrivPartnerships"     "D_In_TechnologyEmployed"     
##  [5] "D_In_CyberSecurity"           "H_In_EnvironmentPolicies"    
##  [7] "H_In_AccessHealthcare"        "H_In_Beds_1000"              
##  [9] "H_In_Doctors_1000"            "H_In_AccessFood"             
## [11] "H_In_QualityHealthServ"       "I_In_EnforceTransportSafety" 
## [13] "I_In_PedestrianFriendliness"  "I_In_QualityRoad"            
## [15] "I_In_QualityElectricity"      "I_In_DisasterManagement"     
## [17] "P_In_PoliceEngage"            "P_In_CommunityPatrol"        
## [19] "P_In_StreetCrimeData"         "P_In_TechForCrime"           
## [21] "P_In_PrivateSecurity"         "P_In_GunRegulation"          
## [23] "P_In_PoliticalStability"
  1. Add city names:
allIN$city=safe$city
  1. Compute the distance among cases:
dist_in_safe=dist(allIN[,-24]) #24 is city position
  1. Compute clusters

With some many data, it would be useful to divide the cities into clusters. We could ask for 5 clusters:

library(cluster)
ResultFromPam= cluster::pam(x=dist_in_safe,
              k = 5, cluster.only = F)

#add to dataframe
allIN$cluster=ResultFromPam$clustering
  1. Get map using multidimensional scaling:
theMap=cmdscale(dist_in_safe,k = 2)
head(theMap,10)
##           [,1]       [,2]
## 1   -5.2101475  41.537274
## 2  -88.8291715  -0.430404
## 3    4.7336749  53.260067
## 4   96.8490858 -22.620966
## 5  -83.2920752  -1.105904
## 6   -0.2189441   7.133945
## 7   60.5252561 -12.102868
## 8  -75.0830797  12.969968
## 9  -32.2039804  58.859438
## 10  83.7611273  10.429404

Add them to data frame

#add to dataframe
allIN$dim1=theMap[,1]
allIN$dim2=theMap[,2]

Let plot the countries using dim1 and dim2 as coordinates:

library(ggplot2)
base=ggplot(allIN,aes(x=dim1,y=dim2,
                       color=cluster,
                       label = city))
cityPoints=base + geom_point() 
 
cityPoints + geom_text()

Let’s use ggrepel to improve labelling:

library(ggrepel)

cityPointsText=cityPoints + theme_void() +
               geom_text_repel(size=1.5,
                               max.overlaps = 20) 

cityPointsText

Let’s use color as a factor:

baseFactorCluster=ggplot(allIN,aes(x=dim1,y=dim2,
                       color=as.factor(cluster),
                       label = city)) 
cityPoints=baseFactorCluster + geom_point()
cityPointsText=cityPoints + theme_void() +
               geom_text_repel(size=1.5,
                               max.overlaps = 20) 

 
cityPointsText

Would you like your own colors? …becareful!

cityPointsText+scale_color_manual(values = c("blue",'darkgreen','red','magenta','grey50'))